Objectives of the project:

Data cleaning and processing

setwd('/Users/Gali/Desktop/DataScienceTechInstitute/17. Big data with R/')
getwd()
## [1] "C:/Users/Gali/Desktop/DataScienceTechInstitute/17. Big data with R"

Load the provided data set AirBnB (1).Rdata and observe what is inside of it.

load(file='C:/Users/Gali/Desktop/DataScienceTechInstitute/17. Big data with R/AirBnB (1).Rdata')
test2 <- L
head(test2)
##         id                           listing_url   scrape_id last_scraped
## 1  4867396  https://www.airbnb.com/rooms/4867396 2.01607e+13   2016-07-03
## 2  7704653  https://www.airbnb.com/rooms/7704653 2.01607e+13   2016-07-04
## 3  2725029  https://www.airbnb.com/rooms/2725029 2.01607e+13   2016-07-04
## 4  9337509  https://www.airbnb.com/rooms/9337509 2.01607e+13   2016-07-03
## 5 12928158 https://www.airbnb.com/rooms/12928158 2.01607e+13   2016-07-04
## 6  5589471  https://www.airbnb.com/rooms/5589471 2.01607e+13   2016-07-04
##                                        name
## 1       Appartement 60m2 Rue Legendre 75017
## 2       Appart au pied de l'arc de triomphe
## 3            Nice appartment in Batignolles
## 4            Charming flat near Batignolles
## 5 Spacious bedroom near the centre of Paris
## 6           Rare, Maison individuelle 200m2
##                                                                                                                                                                                                                                                  summary
## 1      Au 2ème étage d'un bel immeuble joli 2 pièces meublé comprenant: une grande pièce à vivre lumineuse, une chambre, une cuisine, salle de douche et WC séparé. Appartement très calme et lumineux. A proximité de nombreux commerces et transports.
## 2 Nous proposons cette appartement situé en plein coeur de Paris, au pied de l'arc de triomphe. Commerçants, métro, cinéma, vous trouverez à proximité tout ce qu'il faut pour passer quelques jours à Paris en amoureux, entre copains ou en famille ! 
## 3                                                                                                                            Located in the very charming Batignolles, this cozy and bright two-room appartment will perfectly suit your stay in Paris. 
## 4                                          Welcome to my apartment ! This a quiet and cosy flat with 2 room (25 sqm2) fully furnished closed to trendy Batignolles area in the heart of the 17th district. (Near Montmartre foothill / Place de Clichy).
## 5                                                                                                                                                                                            Spacious, quiet and bright room, ideal to explore and enjoy
## 6                                                         Maison individuelle, 200 m2 habitable,rénovée en 2013. Quartier résidentiel, nombreux commerces, restaurants.  Maison familiale, pouvant accueillir 5 adultes et un enfant (1 lit en hauteur).
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             space
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
## 2 L'appartement est composé de : - une grande chambre (environ 15m2) avec un lit simple et d'un matelas d'appoint - une salle de bain avec douche, lave linge/sèche linge - un autre chambre (environ 10m2) avec un lit double (lit gigogne) et une salle de bain dans la chambre (douche) - un grand salon avec une cuisine ouverte (environ 35 m2) - wc séparé Le cuisine est tout équipé : machine nespresso, cocotte-minute, mixeur, lave vaisselle... L'appartement est très lumineux puisqu'il donne sur une avenue large mais calme. Vous trouverez à proximité plein de commercants, de bar pour sortir, de restaurants, des cinémas, des musées. Vous serez au coeur de la ville !  N'hésitez pas à nous contacter pour plus d'information, de photos...
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             description
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Au 2ème étage d'un bel immeuble joli 2 pièces meublé comprenant: une grande pièce à vivre lumineuse, une chambre, une cuisine, salle de douche et WC séparé. Appartement très calme et lumineux. A proximité de nombreux commerces et transports.
## 2 Nous proposons cette appartement situé en plein coeur de Paris, au pied de l'arc de triomphe. Commerçants, métro, cinéma, vous trouverez à proximité tout ce qu'il faut pour passer quelques jours à Paris en amoureux, entre copains ou en famille ! L'appartement est composé de : - une grande chambre (environ 15m2) avec un lit simple et d'un matelas d'appoint - une salle de bain avec douche, lave linge/sèche linge - un autre chambre (environ 10m2) avec un lit double (lit gigogne) et une salle de bain dans la chambre (douche) - un grand salon avec une cuisine ouverte (environ 35 m2) - wc séparé Le cuisine est tout équipé : machine nespresso, cocotte-minute, mixeur, lave vaisselle... L'appartement est très lumineux puisqu'il donne sur une avenue large mais calme. Vous trouverez à proximité plein de commercants, de bar pour sortir, de restaurants, des cinémas, des musées. Vous serez au coeur de la ville !  N'hésitez pas à nous contacter pour plus d'information, de photos...
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Located in the very charming Batignolles, this cozy and bright two-room appartment will perfectly suit your stay in Paris.
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Welcome to my apartment ! This a quiet and cosy flat with 2 room (25 sqm2) fully furnished closed to trendy Batignolles area in the heart of the 17th district. (Near Montmartre foothill / Place de Clichy).
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           Spacious, quiet and bright room, ideal to explore and enjoy
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Maison individuelle, 200 m2 habitable,rénovée en 2013. Quartier résidentiel, nombreux commerces, restaurants.  Maison familiale, pouvant accueillir 5 adultes et un enfant (1 lit en hauteur).
##   experiences_offered neighborhood_overview notes transit access interaction
## 1                none                                                       
## 2                none                                                       
## 3                none                                                       
## 4                none                                                       
## 5                none                                                       
## 6                none                                                       
##   house_rules
## 1            
## 2            
## 3            
## 4            
## 5            
## 6            
##                                                                                   thumbnail_url
## 1                                                                                              
## 2           https://a1.muscache.com/im/pictures/97911969/ef37b496_original.jpg?aki_policy=small
## 3                                                                                              
## 4                                                                                              
## 5 https://a2.muscache.com/im/pictures/df47511b-0e86-4dcb-9887-569489b16020.jpg?aki_policy=small
## 6                                                                                              
##                                                                                       medium_url
## 1                                                                                               
## 2           https://a1.muscache.com/im/pictures/97911969/ef37b496_original.jpg?aki_policy=medium
## 3                                                                                               
## 4                                                                                               
## 5 https://a2.muscache.com/im/pictures/df47511b-0e86-4dcb-9887-569489b16020.jpg?aki_policy=medium
## 6                                                                                               
##                                                                                     picture_url
## 1           https://a1.muscache.com/im/pictures/61090424/02c8a8bb_original.jpg?aki_policy=large
## 2           https://a1.muscache.com/im/pictures/97911969/ef37b496_original.jpg?aki_policy=large
## 3           https://a1.muscache.com/im/pictures/96821426/ea9864f1_original.jpg?aki_policy=large
## 4 https://a2.muscache.com/im/pictures/5fa65f2d-b159-4fb5-986a-bd36cb92d2bc.jpg?aki_policy=large
## 5 https://a2.muscache.com/im/pictures/df47511b-0e86-4dcb-9887-569489b16020.jpg?aki_policy=large
## 6           https://a2.muscache.com/im/pictures/69589240/79d976c4_original.jpg?aki_policy=large
##                                                                                    xl_picture_url
## 1                                                                                                
## 2           https://a1.muscache.com/im/pictures/97911969/ef37b496_original.jpg?aki_policy=x_large
## 3                                                                                                
## 4                                                                                                
## 5 https://a2.muscache.com/im/pictures/df47511b-0e86-4dcb-9887-569489b16020.jpg?aki_policy=x_large
## 6                                                                                                
##    host_id                                   host_url host_name host_since
## 1  9703910  https://www.airbnb.com/users/show/9703910  Matthieu 2013-10-29
## 2 35777602 https://www.airbnb.com/users/show/35777602    Claire 2015-06-14
## 3 13945253 https://www.airbnb.com/users/show/13945253   Vincent 2014-04-06
## 4  5107123  https://www.airbnb.com/users/show/5107123     Julie 2013-02-16
## 5 51195601 https://www.airbnb.com/users/show/51195601   Daniele 2015-12-13
## 6 28980052 https://www.airbnb.com/users/show/28980052  Philippe 2015-03-08
##                      host_location
## 1 Nantes, Pays de la Loire, France
## 2     Paris, Île-de-France, France
## 3     Paris, Île-de-France, France
## 4     Paris, Île-de-France, France
## 5            Prato, Toscana, Italy
## 6     Paris, Île-de-France, France
##                                                                 host_about
## 1                                                                         
## 2                                                                         
## 3                                                                         
## 4 Nous sommes un jeune couple vivant à Paris. Nous aimons beaucoup voyager
## 5                                                                         
## 6                                                                         
##   host_response_time host_response_rate host_acceptance_rate host_is_superhost
## 1                N/A                N/A                  N/A                 f
## 2                N/A                N/A                  N/A                 f
## 3     within an hour               100%                  N/A                 f
## 4       within a day                50%                  N/A                 f
## 5     within an hour               100%                  60%                 f
## 6                N/A                N/A                  N/A                 f
##                                                                                       host_thumbnail_url
## 1  https://a0.muscache.com/im/users/9703910/profile_pic/1383073563/original.jpg?aki_policy=profile_small
## 2 https://a1.muscache.com/im/users/35777602/profile_pic/1438688930/original.jpg?aki_policy=profile_small
## 3 https://a0.muscache.com/im/users/13945253/profile_pic/1396781528/original.jpg?aki_policy=profile_small
## 4  https://a1.muscache.com/im/users/5107123/profile_pic/1425849895/original.jpg?aki_policy=profile_small
## 5  https://a2.muscache.com/im/pictures/e984ba68-7571-46d9-99dc-735ec6e5c9d6.jpg?aki_policy=profile_small
## 6 https://a0.muscache.com/im/users/28980052/profile_pic/1425844331/original.jpg?aki_policy=profile_small
##                                                                                            host_picture_url
## 1  https://a0.muscache.com/im/users/9703910/profile_pic/1383073563/original.jpg?aki_policy=profile_x_medium
## 2 https://a1.muscache.com/im/users/35777602/profile_pic/1438688930/original.jpg?aki_policy=profile_x_medium
## 3 https://a0.muscache.com/im/users/13945253/profile_pic/1396781528/original.jpg?aki_policy=profile_x_medium
## 4  https://a1.muscache.com/im/users/5107123/profile_pic/1425849895/original.jpg?aki_policy=profile_x_medium
## 5  https://a2.muscache.com/im/pictures/e984ba68-7571-46d9-99dc-735ec6e5c9d6.jpg?aki_policy=profile_x_medium
## 6 https://a0.muscache.com/im/users/28980052/profile_pic/1425844331/original.jpg?aki_policy=profile_x_medium
##   host_neighbourhood host_listings_count host_total_listings_count
## 1        Batignolles                   1                         1
## 2     Champs-Elysées                   1                         1
## 3        Batignolles                   1                         1
## 4        Batignolles                   1                         1
## 5             Ternes                   1                         1
## 6        Batignolles                   1                         1
##                       host_verifications host_has_profile_pic
## 1          ['email', 'phone', 'reviews']                    t
## 2          ['email', 'phone', 'reviews']                    t
## 3          ['email', 'phone', 'reviews']                    t
## 4 ['email', 'phone', 'reviews', 'jumio']                    t
## 5 ['email', 'phone', 'reviews', 'jumio']                    t
## 6                     ['email', 'phone']                    t
##   host_identity_verified                                                street
## 1                      f      Rue Legendre, Paris, Île-de-France 75017, France
## 2                      f  Avenue Mac-Mahon, Paris, Île-de-France 75017, France
## 3                      f  Rue la Condamine, Paris, Île-de-France 75017, France
## 4                      t       Rue Gauthey, Paris, Île-de-France 75017, France
## 5                      t Avenue Brunetière, Paris, Île-de-France 75017, France
## 6                      f   Rue de Saussure, Paris, Île-de-France 75017, France
##    neighbourhood neighbourhood_cleansed neighbourhood_group_cleansed  city
## 1    Batignolles    Batignolles-Monceau                           NA Paris
## 2 Champs-Elysées    Batignolles-Monceau                           NA Paris
## 3    Batignolles    Batignolles-Monceau                           NA Paris
## 4    Batignolles    Batignolles-Monceau                           NA Paris
## 5         Ternes    Batignolles-Monceau                           NA Paris
## 6    Batignolles    Batignolles-Monceau                           NA Paris
##           state zipcode market smart_location country_code country latitude
## 1 Île-de-France   75017  Paris  Paris, France           FR  France 48.88880
## 2 Île-de-France   75017  Paris  Paris, France           FR  France 48.87664
## 3 Île-de-France   75017  Paris  Paris, France           FR  France 48.88384
## 4 Île-de-France   75017  Paris  Paris, France           FR  France 48.89236
## 5 Île-de-France   75017  Paris  Paris, France           FR  France 48.88942
## 6 Île-de-France   75017  Paris  Paris, France           FR  France 48.88707
##   longitude is_location_exact property_type       room_type accommodates
## 1  2.320466                 t     Apartment Entire home/apt            2
## 2  2.293724                 t     Apartment Entire home/apt            4
## 3  2.321031                 t     Apartment Entire home/apt            2
## 4  2.322338                 t     Apartment Entire home/apt            2
## 5  2.298321                 t     Apartment    Private room            2
## 6  2.312212                 t         House Entire home/apt            6
##   bathrooms bedrooms beds bed_type
## 1         1        1    1 Real Bed
## 2         2        2    3 Real Bed
## 3         1        1    1 Real Bed
## 4         1        1    1 Real Bed
## 5         1        1    1 Real Bed
## 6         3        4    4 Real Bed
##                                                                                                                                                       amenities
## 1                                                                          {TV,"Cable TV",Internet,"Wireless Internet",Kitchen,Heating,Washer,Dryer,Essentials}
## 2                                                       {"Wireless Internet",Kitchen,"Elevator in Building","Buzzer/Wireless Intercom",Washer,Dryer,Essentials}
## 3                                          {TV,Internet,"Wireless Internet",Kitchen,"Indoor Fireplace",Heating,"Family/Kid Friendly",Washer,Essentials,Shampoo}
## 4                                                                                                       {"Wireless Internet",Kitchen,Heating,Washer,Essentials}
## 5 {"Wireless Internet",Kitchen,"Smoking Allowed","Pets Allowed",Breakfast,"Elevator in Building",Heating,"Family/Kid Friendly",Washer,Dryer,Essentials,Shampoo}
## 6                          {TV,Internet,"Wireless Internet",Kitchen,Heating,"Family/Kid Friendly",Washer,Dryer,"Smoke Detector","Fire Extinguisher",Essentials}
##   square_feet   price weekly_price monthly_price security_deposit cleaning_fee
## 1          NA  $60.00      $388.00                        $200.00       $20.00
## 2          NA $200.00                                                         
## 3          NA  $80.00      $501.00     $1,503.00          $501.00             
## 4          NA  $60.00                                     $250.00             
## 5          NA  $50.00                                                         
## 6          NA $191.00                                                   $50.00
##   guests_included extra_people minimum_nights maximum_nights calendar_updated
## 1               1        $0.00              1           1125     5 months ago
## 2               1        $0.00              1           1125    11 months ago
## 3               1        $0.00              3           1125            today
## 4               0        $0.00              2           1125     8 months ago
## 5               1        $0.00              1             30      4 weeks ago
## 6               1        $0.00              3           1125     5 months ago
##   has_availability availability_30 availability_60 availability_90
## 1               NA               0               0               0
## 2               NA               0               0               0
## 3               NA               6              23              23
## 4               NA              29              59              89
## 5               NA              29              59              89
## 6               NA               0               0               0
##   availability_365 calendar_last_scraped number_of_reviews first_review
## 1                0            2016-07-03                 1   2015-05-19
## 2                0            2016-07-04                 0             
## 3              298            2016-07-04                 1   2015-10-10
## 4              364            2016-07-03                 1   2015-12-15
## 5               89            2016-07-04                 2   2016-06-17
## 6                0            2016-07-04                 0             
##   last_review review_scores_rating review_scores_accuracy
## 1  2015-05-19                  100                     10
## 2                               NA                     NA
## 3  2015-10-10                   80                     NA
## 4  2015-12-15                   80                      6
## 5  2016-06-17                  100                     10
## 6                               NA                     NA
##   review_scores_cleanliness review_scores_checkin review_scores_communication
## 1                        10                    10                          10
## 2                        NA                    NA                          NA
## 3                        NA                    NA                          NA
## 4                        10                     8                          10
## 5                        10                    10                          10
## 6                        NA                    NA                          NA
##   review_scores_location review_scores_value requires_license license
## 1                     10                  10                f        
## 2                     NA                  NA                f        
## 3                     NA                  NA                f        
## 4                      6                   8                f        
## 5                     10                  10                f        
## 6                     NA                  NA                f        
##   jurisdiction_names instant_bookable cancellation_policy
## 1              Paris                f            flexible
## 2              Paris                f            flexible
## 3              Paris                f            flexible
## 4              Paris                f            flexible
## 5              Paris                f            flexible
## 6              Paris                f            flexible
##   require_guest_profile_picture require_guest_phone_verification
## 1                             f                                f
## 2                             f                                f
## 3                             f                                f
## 4                             f                                f
## 5                             f                                f
## 6                             f                                f
##   calculated_host_listings_count reviews_per_month
## 1                              1              0.07
## 2                              1                NA
## 3                              1              0.11
## 4                              1              0.15
## 5                              1              2.00
## 6                              1                NA

It should be mentioned that the following libraries must be installed as they provide necessary tools for our analysis.

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr) 
## Warning: package 'stringr' was built under R version 4.3.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.3.2
library(ggridges)
## Warning: package 'ggridges' was built under R version 4.3.2

At this stage, we started to select the most appropriate parameters that should be consistent with objectives of the project. That is why we excluded the observations consisting of descriptions. As a result, we reduced the number of parameters from 95 to 37

test2 <- select(test2, host_id, host_name, host_since, host_response_rate, host_acceptance_rate, host_is_superhost, host_total_listings_count, host_identity_verified, neighbourhood_cleansed, latitude, longitude, is_location_exact, property_type, room_type, accommodates, bathrooms, bedrooms, beds, bed_type, price, guests_included, extra_people, minimum_nights, maximum_nights, availability_30, availability_60, availability_90, availability_365, number_of_reviews, first_review, instant_bookable, cancellation_policy, require_guest_profile_picture, require_guest_phone_verification, calculated_host_listings_count, reviews_per_month, review_scores_rating)

dim(test2)
## [1] 52725    37

` The following is a summary of our data set. It can be noticed that some variables have missing values therefore we need to transform them before making some conclusions in the analysis. By the way we also see zero values for parameters which should not have any (bedrooms, bathrooms, etc) that is why we need to investigate and pre-process them too.

summary(test2)
##     host_id            host_name          host_since    host_response_rate
##  Min.   :    2626   Marie   :  583   2012-05-04:  166   100%   :26619     
##  1st Qu.: 6158190   Nicolas :  436   2012-06-18:  165   N/A    :12517     
##  Median :15885410   Pierre  :  418   2012-10-25:  155   90%    : 2524     
##  Mean   :22485601   Caroline:  388   2014-03-10:  135   80%    : 1567     
##  3rd Qu.:34348717   Anne    :  387   2015-07-29:  128   50%    :  949     
##  Max.   :81397049   Sophie  :  372   2013-07-20:  116   70%    :  676     
##                     (Other) :50141   (Other)   :51860   (Other): 7873     
##  host_acceptance_rate host_is_superhost host_total_listings_count
##  100%   :19680         :   46           Min.   :   0.00          
##  N/A    :15591        f:50513           1st Qu.:   1.00          
##  0%     : 1377        t: 2166           Median :   1.00          
##  50%    : 1292                          Mean   :   5.83          
##  67%    : 1149                          3rd Qu.:   2.00          
##  75%    :  915                          Max.   :1024.00          
##  (Other):12721                          NA's   :46               
##  host_identity_verified         neighbourhood_cleansed    latitude    
##   :   46                Buttes-Montmartre  : 6025      Min.   :48.81  
##  f:25730                Popincourt         : 4883      1st Qu.:48.85  
##  t:26949                Vaugirard          : 3878      Median :48.86  
##                         Batignolles-Monceau: 3603      Mean   :48.86  
##                         Entrepôt           : 3466      3rd Qu.:48.88  
##                         Passy              : 3074      Max.   :48.91  
##                         (Other)            :27796                     
##    longitude     is_location_exact         property_type  
##  Min.   :2.221   f: 7369           Apartment      :50663  
##  1st Qu.:2.323   t:45356           Loft           :  567  
##  Median :2.347                     House          :  537  
##  Mean   :2.344                     Bed & Breakfast:  394  
##  3rd Qu.:2.369                     Condominium    :  266  
##  Max.   :2.475                     Other          :  122  
##                                    (Other)        :  176  
##            room_type      accommodates      bathrooms       bedrooms     
##  Entire home/apt:45177   Min.   : 1.000   Min.   :0.00   Min.   : 0.000  
##  Private room   : 7001   1st Qu.: 2.000   1st Qu.:1.00   1st Qu.: 1.000  
##  Shared room    :  547   Median : 2.000   Median :1.00   Median : 1.000  
##                          Mean   : 3.051   Mean   :1.09   Mean   : 1.059  
##                          3rd Qu.: 4.000   3rd Qu.:1.00   3rd Qu.: 1.000  
##                          Max.   :16.000   Max.   :8.00   Max.   :10.000  
##                                           NA's   :243    NA's   :193     
##       beds                 bed_type         price       guests_included 
##  Min.   : 0.000   Airbed       :   35   $60.00 : 3055   Min.   : 0.000  
##  1st Qu.: 1.000   Couch        : 1182   $50.00 : 3047   1st Qu.: 1.000  
##  Median : 1.000   Futon        :  449   $70.00 : 2787   Median : 1.000  
##  Mean   : 1.684   Pull-out Sofa: 5066   $80.00 : 2598   Mean   : 1.353  
##  3rd Qu.: 2.000   Real Bed     :45993   $100.00: 2073   3rd Qu.: 2.000  
##  Max.   :16.000                         $90.00 : 2031   Max.   :16.000  
##  NA's   :80                             (Other):37134                   
##   extra_people   minimum_nights     maximum_nights      availability_30
##  $0.00  :37324   Min.   :   1.000   Min.   :1.000e+00   Min.   : 0.00  
##  $10.00 : 4453   1st Qu.:   1.000   1st Qu.:6.000e+01   1st Qu.: 0.00  
##  $20.00 : 2653   Median :   2.000   Median :1.125e+03   Median : 8.00  
##  $15.00 : 2469   Mean   :   3.128   Mean   :1.253e+05   Mean   :11.65  
##  $5.00  : 1179   3rd Qu.:   3.000   3rd Qu.:1.125e+03   3rd Qu.:23.00  
##  $30.00 :  989   Max.   :1000.000   Max.   :2.147e+09   Max.   :30.00  
##  (Other): 3658                                                         
##  availability_60 availability_90 availability_365 number_of_reviews
##  Min.   : 0.00   Min.   : 0.00   Min.   :  0.0    Min.   :  0.00   
##  1st Qu.: 2.00   1st Qu.: 6.00   1st Qu.: 22.0    1st Qu.:  0.00   
##  Median :26.00   Median :37.00   Median :183.0    Median :  3.00   
##  Mean   :27.33   Mean   :41.18   Mean   :179.5    Mean   : 12.59   
##  3rd Qu.:50.00   3rd Qu.:75.00   3rd Qu.:336.0    3rd Qu.: 13.00   
##  Max.   :60.00   Max.   :90.00   Max.   :365.0    Max.   :392.00   
##                                                                    
##      first_review   instant_bookable      cancellation_policy
##            :14508   f:44186          flexible       :19244   
##  2016-05-08:  212   t: 8539          moderate       :15039   
##  2016-06-13:  193                    strict         :18427   
##  2016-01-03:  186                    super_strict_30:    6   
##  2016-01-02:  183                    super_strict_60:    9   
##  2015-09-21:  173                                            
##  (Other)   :37270                                            
##  require_guest_profile_picture require_guest_phone_verification
##  f:51816                       f:51014                         
##  t:  909                       t: 1711                         
##                                                                
##                                                                
##                                                                
##                                                                
##                                                                
##  calculated_host_listings_count reviews_per_month review_scores_rating
##  Min.   :  1.000                Min.   : 0.010    Min.   : 20.00      
##  1st Qu.:  1.000                1st Qu.: 0.360    1st Qu.: 87.00      
##  Median :  1.000                Median : 0.900    Median : 93.00      
##  Mean   :  4.087                Mean   : 1.336    Mean   : 91.01      
##  3rd Qu.:  1.000                3rd Qu.: 1.870    3rd Qu.: 97.00      
##  Max.   :155.000                Max.   :14.290    Max.   :100.00      
##                                 NA's   :14508     NA's   :15454

Initially, there were 30524 missing values, however after some processing their number was increased to 73419 values.

sum(is.na(test2)) 
## [1] 30524
test2[test2 == ""] <- NA
test2[test2 == "N/A"] <- NA
sum(is.na(test2)) 
## [1] 73419

The following parameters have the most of the missing values which should be replaced or omitted.

NAs_qty = colSums(is.na(test2))
NAs_prop = round(NAs_qty/nrow(test2), 3)
NAs.df <- data.frame(NAs_qty, NAs_prop)
NAs.df[NAs.df$NAs_prop > 0.01, ]
##                      NAs_qty NAs_prop
## host_response_rate     12563    0.238
## host_acceptance_rate   15637    0.297
## first_review           14508    0.275
## reviews_per_month      14508    0.275
## review_scores_rating   15454    0.293

Some of those features require certain transformations of the data type in order to use them properly. It is also worth noting that first_review determines the visit frequency that is why it was decided to replace some omissions with values of host_since as we consider that this assumption may be relevant for hosts who has just one apartment.

test2$host_response_rate <- as.numeric(sub("%", " ", test2$host_response_rate))
test2$host_acceptance_rate <- as.numeric(sub("%", " ", test2$host_acceptance_rate))

test2 <- test2 %>% mutate(first_review = case_when (
  is.na(first_review) & calculated_host_listings_count == 1  ~ host_since,
  .default = first_review ))

test2 = test2 %>%
  mutate(first_review = as.Date(paste(first_review,sep='-')))

The final outcomes of the modified features must be presented in the form of histograms as we need to estimate the data distribution.

par(mfrow = c(2,2))
hist(test2$host_response_rate,breaks = 30, col="lavender", main = "Host response rate",xlab="host response rate")
hist(test2$host_acceptance_rate,breaks = 30, col="lavender", main = "Acceptance rate",xlab="host acceptance rate")
hist(test2$reviews_per_month,breaks = 30, col="lavender", main = "Reviews per month",xlab="reviews per month")
hist(test2$review_scores_rating ,breaks = 30, col="lavender", main = "Review scores rating",xlab="review scores rating")

According to these diagrams, it can be said that the missing values might be replaced with median which is a mean for skewed distributions.

test2$host_response_rate[is.na(test2$host_response_rate)] <- median(test2$host_response_rate, na.rm = TRUE)
test2$host_acceptance_rate[is.na(test2$host_acceptance_rate)] <- median(test2$host_acceptance_rate, na.rm = TRUE)
test2$reviews_per_month[is.na(test2$reviews_per_month)] <- median(test2$reviews_per_month, na.rm = TRUE)
test2$review_scores_rating[is.na(test2$review_scores_rating)] <- median(test2$review_scores_rating, na.rm = TRUE)

Now we check the data frame of the missing values again in order to justify the possibility to exclude the rest of these data. As it can be seen below that maximum of omissions operates at 5% therefore they can be neglected.

NAs_qty = colSums(is.na(test2))
NAs_prop = round(NAs_qty/nrow(test2), 3)
NAs.df <- data.frame(NAs_qty, NAs_prop)
NAs.df[order(-NAs.df$NAs_prop),]
##                                  NAs_qty NAs_prop
## first_review                        2766    0.052
## bathrooms                            243    0.005
## bedrooms                             193    0.004
## beds                                  80    0.002
## host_name                             46    0.001
## host_since                            46    0.001
## host_is_superhost                     46    0.001
## host_total_listings_count             46    0.001
## host_identity_verified                46    0.001
## host_id                                0    0.000
## host_response_rate                     0    0.000
## host_acceptance_rate                   0    0.000
## neighbourhood_cleansed                 0    0.000
## latitude                               0    0.000
## longitude                              0    0.000
## is_location_exact                      0    0.000
## property_type                          3    0.000
## room_type                              0    0.000
## accommodates                           0    0.000
## bed_type                               0    0.000
## price                                  0    0.000
## guests_included                        0    0.000
## extra_people                           0    0.000
## minimum_nights                         0    0.000
## maximum_nights                         0    0.000
## availability_30                        0    0.000
## availability_60                        0    0.000
## availability_90                        0    0.000
## availability_365                       0    0.000
## number_of_reviews                      0    0.000
## instant_bookable                       0    0.000
## cancellation_policy                    0    0.000
## require_guest_profile_picture          0    0.000
## require_guest_phone_verification       0    0.000
## calculated_host_listings_count         0    0.000
## reviews_per_month                      0    0.000
## review_scores_rating                   0    0.000
test2 <- na.omit(test2)

It is also worth to say that price and extra_people were converted to numerical value in this project.

pattern <- "\\$(\\d+)"
test2$price <- as.numeric(str_match((str_replace_all(test2$price, ",", "")), pattern)[,2])
test2$extra_people <- as.numeric(str_match((str_replace_all(test2$extra_people, ",", "")), pattern)[,2])

As it was mentioned earlier, some observations have zero values which should be definitely replaced. The following steps are directed exactly at this purpose.

zero_qty = colSums(test2[,1:37]==0)
zero_prop = round(zero_qty/nrow(test2), 3)
zero.df <- data.frame(zero_qty, zero_prop)
zero.df <- arrange(zero.df,desc(zero_qty))
zero.df[zero.df$zero_qty != 0, ]
##                           zero_qty zero_prop
## extra_people                 34784     0.702
## availability_30              14080     0.284
## number_of_reviews            11659     0.235
## availability_60              11569     0.234
## availability_90              10445     0.211
## bedrooms                     10030     0.203
## availability_365              8255     0.167
## guests_included               2099     0.042
## host_acceptance_rate          1323     0.027
## bathrooms                       82     0.002
## host_total_listings_count        5     0.000
## price                            2     0.000
## beds                             1     0.000

The same replacing procedure is presented below where median substitutes for zero values.

par(mfrow = c(1,2))
hist(test2$number_of_reviews ,breaks = 30, col="lavender", main = "Number of reviews",xlab="number of reviews")
hist(test2$host_acceptance_rate ,breaks = 30, col="lavender", main = "Host acceptance rate",xlab="host acceptance rate")

test2$number_of_reviews[test2$number_of_reviews == 0] <- median(test2$number_of_reviews, na.rm = TRUE)
test2$host_acceptance_rate[test2$host_acceptance_rate == 0] <- median(test2$host_acceptance_rate, na.rm = TRUE)

The beds parameter should be processed differently therefore we need to calculate the amount of observations separately to each category. As it can be seen, most of the data is divided into three main groups.

beds_zero_val <- select(test2,bedrooms, beds) %>% filter(bedrooms == 0) %>% group_by(beds)%>%
 mutate(Counts = n()) %>% summarise(beds_qty = unique(Counts))
beds_zero_val
## # A tibble: 6 × 2
##    beds beds_qty
##   <int>    <int>
## 1     1     8131
## 2     2     1774
## 3     3      108
## 4     4       14
## 5     5        2
## 6     7        1

After that we assumed that it can be appropriate to determine the bed’s categories based on the relationship between bedrooms and beds.

test2 %>% 
  ggplot(aes(x=factor(bedrooms), y=beds, fill=factor(bedrooms))) +
  geom_boxplot(show.legend = FALSE) + ggtitle("The relationship between beds and bedrooms") + theme(
plot.title = element_text(color="black", size=14, face="bold.italic", hjust = 0.5))

According to this box plot, the following conditional algorithm was implemented.

test2 <- test2 %>%
  mutate(bedrooms = ifelse(bedrooms == 0 & beds == 1, median(test2$bedrooms[test2$beds == 1], na.rm = TRUE),
  ifelse(bedrooms == 0 & beds == 2, median(test2$bedrooms[test2$beds == 2], na.rm = TRUE),ifelse(bedrooms == 0 & beds > 2, 3, test2$bedrooms)
)))

For the next features, it was decided to filter them as the proportion of zero values of less than 1%

test2 <- test2 %>% filter(bathrooms > 0)
test2 <- test2 %>% filter(host_total_listings_count> 0)
test2 <- test2 %>% filter(price > 0)
test2 <- test2 %>% filter(beds > 0)

As the result, we have 49431 values which is only 7% less than we had before these manipulations.

dim(test2)
## [1] 49431    37

Exploratory data analysis

1) Visit frequency of the different quarters according to time

It can be seen that all neighborhoods have almost similar distributions of the visit frequency. So it can supposed that location does not have so much influence on the tourist choice.

for (i in unique(test2$neighbourhood_cleansed)){
  hist(test2$first_review[test2$neighbourhood_cleansed == i],breaks = "month", col="lavender", main = i,xlab="Date")
}

2) Number of apartments per owner

Using host_id we could to calculate the number of apartments per owner. As the result we have 44175 individual owners. The top 50 hosts of this list is presented below in the form of histogram. Some names are repeated that is why they were overlapped with each other.

appart_per_owner <- select(test2, host_id, host_name) %>% group_by(host_id) %>% mutate(Counts = n())
appart_per_owner <- appart_per_owner[!duplicated(appart_per_owner$host_id),] %>% 
  arrange(desc(Counts))

appart_per_owner  
## # A tibble: 44,175 × 3
## # Groups:   host_id [44,175]
##     host_id host_name         Counts
##       <int> <fct>              <int>
##  1  2288803 Fabien                76
##  2  3972699 Hanane                63
##  3  3943828 Caroline              60
##  4 12984381 Olivier               51
##  5 11593703 Rudy And Benjamin     47
##  6  7612270 Paul                  46
##  7  2667370 Parisian Home         43
##  8   152242 Delphine              41
##  9 13013633 Benjamin              40
## 10  3971743 Diane                 40
## # ℹ 44,165 more rows
ggplot(data=appart_per_owner[1:50,], aes(x=(reorder(host_name,  -Counts)),  y=Counts)) +
  geom_bar(stat="identity", color="black", fill="red")+
  geom_text(aes(label=Counts), vjust=-0.3, size=2.5) + xlab("owner's name") + ylab("apart_num") +
  theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=1)) + ggtitle("Number of apartments of top 50 hosts") + theme(plot.title = element_text(color="black", size=14, face="bold.italic", hjust = 0.5))

3) Relationship between prices and apartment features

To start with, it was necessary to include a new feature log_price in the data set as it was required to normalize price parameter for the further exploitation.

test2$log_price <- log(test2$price)

All these different box plots demonstrate the relationships between prices and apartment features. In general, we see the increasing trend which indicates that the prices depend on the number of the apartment features. However, the increase of the number of bathrooms has a positive effect until 4.5 after that the downward trend prevails.

ggplot(data = test2) +
  geom_boxplot(aes(x=factor(beds),y=log_price, fill=factor(beds))) + xlab("beds") + ggtitle("The relationship between prices and apartment feature") + theme(
plot.title = element_text(color="black", size=14, face="bold.italic", hjust = 0.5)) + guides(fill = guide_legend(title = "Bed"))

ggplot(data = test2) +
  geom_boxplot(aes(x=factor(bedrooms),y=log_price, fill=factor(bedrooms))) + xlab("bedrooms") + ggtitle("The relationship between prices and apartment feature") + theme(
plot.title = element_text(color="black", size=14, face="bold.italic", hjust = 0.5)) + guides(fill = guide_legend(title = "Bedroom"))

ggplot(data = test2) +
  geom_boxplot(aes(x=factor(bathrooms),y=log_price, fill=factor(bathrooms))) + xlab("bathrooms") + ggtitle("The relationship between prices and apartment feature") + theme(
plot.title = element_text(color="black", size=14, face="bold.italic", hjust = 0.5)) + guides(fill = guide_legend(title = "Bathroom"))

The group division was also investigated where we can observe the same trend for all of the neighborhoods.

ggplot(data = test2) +
  geom_boxplot(aes(x=factor(beds),y=log_price)) +
  facet_wrap(~ neighbourhood_cleansed) +
  theme_minimal(base_size=8.5) + xlab("beds")

ggplot(data = test2) +
  geom_boxplot(aes(x=factor(bedrooms),y=log_price)) +
  facet_wrap(~ neighbourhood_cleansed) +
  theme_minimal(base_size=13) + xlab("bedrooms")

ggplot(data = test2) +
  geom_boxplot(aes(x=factor(bathrooms),y=log_price)) +
  facet_wrap(~ neighbourhood_cleansed)+
  theme_minimal(base_size=8) +
  theme(axis.text.x = element_text(angle=90, vjust=0.5)) + xlab("bathrooms")

The last diagram differentiates the quarters (these include Elysee, Passy, Palais-Bourbon, Luxembourg, etc) with most expensive apartments relatively to the main features .

ggplot(data = test2) +
  geom_point(aes(x=bedrooms,y=bathrooms,size=price, col=log_price)) + scale_colour_gradient(low = "white", high = "black") +
  facet_wrap(~ neighbourhood_cleansed, nrow=3)

4) Renting price per city quarter (“arroundissments”)

The mean price was calculated for each quarter. In general, the results prove the previous statement about the neighborhoods.

neighbor_price <- test2 %>% select(neighbourhood_cleansed, price) %>%
  group_by(neighbourhood_cleansed) %>%
  mutate(average_price_per_neighb = mean(price)) %>% summarise(unique(average_price_per_neighb))
colnames(neighbor_price) = c("Neighbourhood","Average_price")
neighbor_price[order(-neighbor_price$Average_price),]
## # A tibble: 20 × 2
##    Neighbourhood       Average_price
##    <fct>                       <dbl>
##  1 Élysée                      154. 
##  2 Luxembourg                  141. 
##  3 Louvre                      136. 
##  4 Palais-Bourbon              134. 
##  5 Hôtel-de-Ville              128. 
##  6 Passy                       118. 
##  7 Bourse                      117. 
##  8 Temple                      115. 
##  9 Panthéon                    112. 
## 10 Opéra                        96.7
## 11 Vaugirard                    89.5
## 12 Batignolles-Monceau          87.6
## 13 Observatoire                 81.6
## 14 Entrepôt                     81.0
## 15 Popincourt                   78.6
## 16 Reuilly                      76.4
## 17 Buttes-Montmartre            73.8
## 18 Gobelins                     72.5
## 19 Buttes-Chaumont              66.1
## 20 Ménilmontant                 65.9

The following diagrams show the variance between different neighborhoods where the richest quarters can be easily identified.

ggplot(test2, aes(x=neighbourhood_cleansed, y=log_price, fill=neighbourhood_cleansed)) +
  geom_boxplot(show.legend = FALSE) + coord_flip()  + xlab("neighborhood")

ggplot(test2, aes(x=neighbourhood_cleansed, y=log_price, fill=neighbourhood_cleansed)) + 
  geom_violin(trim=FALSE)  + xlab("neighbourhood") +
  ggtitle("Renting price per city quarter") + theme(plot.title = element_text(color="black", size=14, face="bold.italic", hjust = 0.5)) +
        theme(axis.text.x = element_text(angle=90, hjust=1)) + geom_boxplot(width=0.1)

ggplot(data = test2, mapping = aes(x = price, y = neighbourhood_cleansed)) +
    geom_density_ridges(mapping = aes(fill = neighbourhood_cleansed), bandwidth = 130, alpha = .6, size = 1) +
  theme_ridges()  +
  xlab("Price") +
    ylab("") +
    ggtitle("Price behavior ") + xlim(-250,2000) + guides(fill = guide_legend(title = "Neighborhood"))
## Warning: Removed 5 rows containing non-finite values (`stat_density_ridges()`).

It was also interesting to represent the price range on the map (using leaflet tool) to observe the difference between neighborhoods. For this purpose we created a new feature price_group. From this map it can be concluded that the closer to the center the more expensive it gets.

test2 <- test2 %>% mutate(price_group=ifelse(price < 50, "Low", ifelse(price > 50 & price < 100, "Moderate", "High" )))
 pal <- colorFactor(palette = c("red", "green", "blue"), domain = test2$price_group)
 
 leaflet(data = test2) %>% addProviderTiles(providers$CartoDB.Positron) %>%  addCircleMarkers(~longitude, ~latitude, color = ~pal(price_group), weight = 1, radius=1, fillOpacity = 0.1, opacity = 0.1,                  label = paste("Neighbourhood:", test2$neighbourhood_cleansed)) %>% 
     addLegend("bottomright", pal = pal, values = ~price_group,
     title = "Price groups",
     opacity = 1
   )
#save(test2,file='Airbnb_cleansed.Rdata')